Red Wine Quality Data Exploration

The wine quality dataset was created by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009, using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

## [1] 1599   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

By creating histogram plots is a good way to have an idea about how each attributes are changing by themselves. The plots will help me to know all features at the first view.

Many of the variables look normally distributed. Chlorides, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide look like they have lognormal distributions. Let’s exclude the 95th percentile for all these five features and re-plot their histograms:

The distributions for chlorides, sulphates, alcohol, free sulfur dioxide, and total sulfur dioxide look normal after excluding the outliers.

Univariate Analysis

What is the structure of your dataset?

Number of red wine instances: 1599 Number of Attributes: 1 Serial Number + 11 Attributes + 1 Output Attribute

11 Attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Residual sugar, fixed acidity, pH, density and alcohol content may help support the investigation into the quality.

Did you create any new variables from existing variables in the dataset?

Yes, I do. Since the first column is all serial numbers, there is not any statistical significance. The column, named X, has been remmoved from the original dataset.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Attributes of chlorides, total sulfur dioxide, and free sulfur dioxide, sulphates, alcohol were all appeared to be long tailed and were log-transformed which revealed a normal distribution for each.

Bivariate Plots Section

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

With our main feature of the dataset, the positive correlation coefficients which are more then 0.1 are:

 alchol:quality = 0.5
 sulphates:quality = 0.3
 citric.acid:quality = 0.2
 fixed.acidity:quality = 0.1

So alcohol content has a high correlation with red wine quality. Other important attributes correlated with red wine quality include sulphates, citric acid and fixed acidity.

As we can see, from above plot with alcohol contect across quality, there is a large amount of samples with quality score 5 and also 9.5% alcohol. The samples with a higher quality score also have a higher alcohol percentage.

With our main feature of the dataset, the negative correlation coefficients which are less then -0.1 are:

 volatile.acidity:quality = -0.4
 total.sulfur.dioxide:quality = -0.2
 density:quality = -0.2
 chlorides:quality = -0.1

So we see that volatile acids are negatively correlated with red wine quality, as described from the document that is at too high of levels can lead to an unpleasant, vinegar taste. Total sulfur dioxide, density and chlorides are also negatively correlated with quality.

Besides, other attributes wiht the highest (positive or negative) correlation are:

 fixed.acidity:pH = -0.7
 fixed.acidity:citric.acid = 0.7
 fixed.acidity:density = 0.7
 free.sulfur.dioxide:total.sulfur.dioxide = 0.7
 volatile.acidity:citirc.acid = -0.6
 citric.acid:pH = -0.5
 density:alcohol = -0.5

pH Attribute

As we all know, the stronger the acid is, the lower pH will be. So it is make sence that either fixed acidity or citric acid has a high negative correlation with pH. All three features are acids. I’ve thought all acids will lower the value of pH. However, from above plot of pH across volatile acidity, with more content of volatile acidity, the value of pH increase a little bit. From this set of plots, I found the acidity of volatile is weaker than the other two acids, and fixed acid should be the strongest one here.

I will focus on several other highest correlation relationships in a bit more detail.

Acid Attributes

Wine Acids play a large role in winemaking. Each acid plays a different role in the winemaking game. I would like to see how three kinds of acids working with the quality of red wine. Fixed Acidity is a background player, supporting and stabilizing the wine as it evolves. It preserves the stability of the wine. So, the incresing of fixed acidity does not affact a lot on wine’s quality, but helped a little bit. Volatile Acidity, also known as malic acid, is high prior to veraison, but as grapes ripen, it escapes the grapes through respiration. Cooler climates produce grapes with higher levels of malic acid due to the cooler temperatures and low rates of respiration. In another words, volatile acidity can be virtually nothing if it is a really hot year. The malic acid gets used up in the respiration. Malic is a harsher acid, which at too high of levels can lead to an unpleasant, vinegar taste. So as the plot showing, the more volatile acidity the wine contains, the lower quality the wine will be Citric acid is in a really small amount compared with the other two acids. It can be noticed from the range of each x-axis of above three plots. The data points from the plot of quality with fixed acidity are focusing from 6 to 10, which is almost 14 times of with volatile acidity and almost 32 times of with citric acid.

Density Attribute

The density of wine somehow descides the taste of thick or refreshing. The description of density says the density of wine is close to that of water depending on the percent alcohol and sugar content. I would like to find out how alcohol and sugar content affact density, and also how density would work with the quality. Obviously, the density of wine has been affacted a lot by alcohol and residual sugar. The wine would be more refreshing while having more alcohol, however would be thicker if adding more sugar content. Well, the wine sometimes has too much acidity, but winemakers don’t want to remove the volatile acid, they balance with sugar. The acids could be balanced off with residual sugar. Adding residual sugar might balance the taste of the wine. As we can see from the plot of quality with density, the quality is decreasing while the wine become thicker.

How Volatile Acidity Affact Wine Quality

As of the feature of volatile acidity has a large negative pearson correlation coefficient. I would like to see more detail of how volatile acidity working with wine quality. Based on the above boxplot, it is really easy to tell the observed result. High value of volatile acidity is truly lower the score of wine quality.

At the end of bbivariate analysis, I would like to re-focus on the main feature of dataset, which is quality. This boxplot is also showing one of the strongest relationship between quality and alcohol.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As of the quality, it appears that when alchol or sulphates is in higher amounts, the quality will be better also. However, the amount of volatile acidity is negatively correlated with the quality. It is likely that fresher wines avoid the bitter taste of acetic acid.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As of citric acid, fixed acidity is positively correlated with the citric acid, but the amount of volatile acidity is opposite. As of density, fixed acidity is also positively correlated with the citric acid, but the amount of alcohol is opposite.

What was the strongest relationship you found?

From the variables analyzed, the strongest relationship was between fixed.acidity and pH, which had a correlation coefficient of 0.68.

Multivariate Plots Section

Now let’s visualize the relationship between density, volatile.acidity, alcohol and quality: On the above scatter plot, the darker the blue is means the wine with higher quality. In other words, white points are with the lowest quality, and darkest blue points are with the highest quality. Besides, the most of white points are shown on the up-left part of canvas, and the bottom right corner has more blue or dark blue points. It means that most of the wine with higher quality scores have higher alcohol content and also lower volatile acidity.

Below faceted plots tried to see how sulphates or alcohol affacts the quality of wine Interesting, sulphates also slightly affact the quality of wine. From above six plots, all scatter points are slightly moving to right. We almost cannot realize between two contiguous plots. But while we compare the first one with the fifth or sixth one, it actually shift a step to right. It comes out that sulphates help a little bit to increase the score of wine quality.

Next, let’s try to summarize quality using a contour plot of volatile acidity and sulphate content: Now, we almost can tell the result before plotting. As of sulphates are positive correlated with quality, while volatile acidity is negative correlated with quality. So, the contour plot with the highest score of quality should show up with higher value of sulphates and lower value of volatile acidity. No wonder, the plot exactly shows the result what we are expecting.

This shows that higher quality red wines are generally located near the range from 0.25 to 0.65 of citric acid and slso near the higher alcohol which is more than 10.5%. Whereas lower quality red wines are generally with lower either alcohol or citric acid.

Let’s try to summarize quality using a contour plot of density and alcohol content: From above plot, we can tell that density does not really affact a lot of quality, but alcohol does.

I am tring to use this plot to tell the same result with the previous one. However, the latest plot can tell more information within one plot.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Based on the multivariate analysis, five features stood out to me: alcohol, sulphates, citric acid, volatile acidity, and quality. Volatile acidity with amount between 0.3 and 0.5 and sulphates with amount between 0.6 and 0.9 were a strong indicator of the presence of good wine. Also, high alcohol content and higher citric acid have more chance to make for a good wine.

Final Plots and Summary

Plot One

As analyzing relationship between quality and other 11 attributes, the strongest correlation coefficient was found between alcohol and quality.

## # A tibble: 6 x 2
##   quality     n
##     <int> <int>
## 1       3    10
## 2       4    53
## 3       5   681
## 4       6   638
## 5       7   199
## 6       8    18
## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Description One

Clearly we see that the box plots for higher quality red wines are up shifted, meaning they have a comparatively higher alcohol content, compared to the lower quality red wines.

Plot Two

Description Two

Observe that lower sulphates content typically leads to a bad wine with alcohol varying between 9% and 12%. Average wines have higher concentrations of sulphates, however wines that are rated 6 tend to have higher alcohol content and larger sulphates content. Excellent wines are mostly clustered around higher alcohol contents and higher sulphate contents.

Plot Three

Description Three

This shows that higher quality red wines are generally having higher percentage of alcohol, which is more than 11%, and having slightly lower density, which means the refreshing wine is somehow being more popular. With the help of density, actually, for the low quality levels with score of 3, 4 and 5, it is hard to tell how alcohol percentage affact the quality. Then, for the high levels of 6, 7 and 8, it is so obverious that more alcohol content would result a better wine quality.


Reflection

The red wine dataset contains information on 1,599 red wine instances, 11 attributes and one output attribute. Initially, I tried to get a sense of how is each attribute changing on their own. All univariate plots have been arranged together. Many of the variables look normally distributed. However, some of features have lognormal distributions. I exclude the 95th percentile for these features and re-plot their histograms.

Then, I tried to find what factors might affect the quality of the wine. At this moment, pearson correlation coefficient can help us to visualize the relationship between each pair of variables. Using the insights from correlation coefficients provided by the paired plots, it was interesting exploring quality using box plots with a different color for each quality. Besides, melting the dataframe and using facet grids was really helpful for visualizing the distribution of the parameters with the use of scatter plots. Finally, using a contour plot of wine quality with a point plot of volatile acidity and alcohol would be a good choice to show that either the lower volatile acidity or higher alcohol have more possible to make a better wine. The result makes sense. Volatile acidity is mostly caused by bacteria in the wine which is the amount of acetic acid in wine. It can lead to an unpleasant, vinegar taste if at too high of levels.

The hardest time for me is to understanding all the features with wiki pedia or other documents. But to be a good data analyst, we must study and understand the data structure as much as we can. Finally, I figured out all attributes of red wine.

The dataset may include more features of the environment where grapes are grown. As we all known, location and temperature play the important roles in the quality of wine.

Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.